Skip to content

Replace kl_penalty_reference_step with kl_penalty_step_lag#625

Open
angkywilliam wants to merge 13 commits intofeat/pipeline-klfrom
feat/pipeline-kl-step-lag
Open

Replace kl_penalty_reference_step with kl_penalty_step_lag#625
angkywilliam wants to merge 13 commits intofeat/pipeline-klfrom
feat/pipeline-kl-step-lag

Conversation

@angkywilliam
Copy link
Collaborator

@angkywilliam angkywilliam commented Mar 20, 2026

Summary

  • Rename kl_penalty_reference_step to kl_penalty_step_lag in PipelineTrainer
  • None (default): uses step 0 as KL reference (anchor to initial model)
  • >= 1: uses max(0, current_step - lag) as reference (rolling anchor)
  • Add validation that kl_penalty_step_lag must be >= 1 if specified

Test plan

  • Unit tests pass (test_pipeline_trainer_local_backend.py)
  • New tests for lag computation added
  • Integration tests (requires 2 GPUs)

🤖 Generated with Claude Code

Add tinker variant of KL-penalized advantage training script and
align model naming conventions (backend-random-coef) across both.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants